Linear Models for Classification

The goal in classification is to take an input vector \(\vec{x}\) and to assign it to one of \(K\) discrete classes \(\mathcal{C}_k\) where \(K = 1,...,K\).The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces.Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable.

In Chapter 1,we identified three distinct approaches to the classification problem.The simplest involves constructing a discriminant function that directly assigns each vector x to a specific class. A more powerful approach, however, models the conditional probability distribution \(p(C_k|x)\) in an inference stage, and then subsequently uses this distribution to make optimal decisions. By separating inference and decision, we gain numerous benefits, as discussed in [chapter:Introduction]. There are two different approaches to determining the conditional probabilities \(p(C_k|x)\). One technique is to model them directly, for example by representing them as parametric models and then optimizing the parameters using a training set. Alternatively, we can adopt a generative approach in which we model the class-conditional densities given by p(x|Ck), together with the prior probabilities p(Ck) for the classes, and then we compute the required posterior probabilities using Bayes’ theorem \[p(\mathcal{C}_k|\vec{x}) = \dfrac{p(\vec{x}\mathcal{C}_k)p(\mathcal{C}_k)}{p(\vec{x})}\]

In the linear regression models considered in Chapter 3, the model prediction y(x,w) was given by a linear function of the parameters w. In the simplest case, the model is also linear in the input variables and therefore takes the form y(x) = \(\vec{w}^Tx+w_0\), so that y is a real number. For classification problems, however, we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range (0, 1). To achieve this, we consider a generalization of this model in which we transform the linear function of w using a nonlinear function \(f( · )\) so that \[y(\vec{x}) = f(\vec{w}^T\vec{x} + w_0)\] In the machine learning literature \(f(\cdot)\) is known as an activation function,whereas its inverse is called a link function in the statistics literature.The decision surfaces correspond to y(x) = constant, so that \(\vec{w}^T\vec{x} + w_0 = constant\) and hence the decision surfaces are linear functions of \(\vec{x}\), even if the function \(f(·)\) is nonlinear. For this reason, the class of models described by (4.3) are called generalized linear models (McCullagh and Nelder,1996).However,in contrast to the models used for regression,they are no longer linear in the parameters due to the presence of the nonlinear function \(f(\cdot)\).

Discriminant Functions

A discriminant is a function that takes an input vector \(\vec{x}\) and assigns it to be one of \(K\) classes,denoted \(\mathcal{C}_k\).

Two classes

Linear discriminant function \[\begin{aligned} y(\vec{x}) = \vec{x}^T\vec{x}+w_0\end{aligned}\] where \(\vec{w}\) is called a weight vector,and \(w_0\) is \(bias\).The negative of the bias is sometimes called threshold.

The value of \(y(\vec{x})\) gives a signed measure of the perpendicular distance \(r\) of point \(\vec{x}\) from the decision surface.Consider an arbitrary point \(\vec{x}\) and let \(\vec{x}_\perp\) be its orthogonal projection onto the decision surface,then \[\begin{aligned} \vec{x} &= \vec{x}_\perp + r\dfrac{\vec{w}}{\parallel\vec{w}\parallel} \\\end{aligned}\] Multiplying both sides by \(\vec{w}^T\) and adding \(\vec{w}_0\), \[\begin{aligned} \label{eqn:distance from point to hyperplane} y(\vec{x}) &= y(\vec{x}_{\perp}) + r\dfrac{\vec{w}\cdot\vec{w}}{\parallel\vec{w}\parallel} \\ &= r\parallel\vec{w}\parallel\\ \therefore r &= \dfrac{y(\vec{x})}{\parallel\vec{w}\parallel}\end{aligned}\]

Multiple classes

Consider the extension of linear discriminants to \( K >2\) classes,building a \(K\)-class discriminants by combining a number of two-class discriminant functions on which leads to some serious difficulties(Duda and Hart,1973).

Consider the use of \(K-1\) classifier each of which solves a two-class problem of separating points int particular class \(\mathcal{C}_k\) from points not in that class.This is known as a one-versus-the-rest classifier.An alternative is to introduce \(K(K-1)/2\) binary discriminant functions,one for every possible pair classes.This is known as a one-versus-one classifier.Both of these two approaches run into the problem of ambiguous regions.We can avoid these difficulties by considering a single \(K\)-class discriminant comprising \(k\) linear functions of the form \[y_k(\vec{x}) = \vec{w_k}^T\vec{x}+w_{k0}\] and then assigning a point \(\vec{x}\) to class \(\mathcal{C}_k\) if \(y_k(\vec{x}) > y_j(\vec{x})\) for all \(j \neq k\).The decision boundary between class \(\mathcal{C}_k\) and \(\mathcal{C}_j\) is therefore given by \(y_k(\vec{x}) = y_j(\vec{x})\) and hence corresponds to a \((D-1)\)-dimensional hyperplane defined by \[(\vec{w_k}-\vec{w_j})^T + (w_{k0}-w_{j0}) = 0\] The decision regions of such a discriminant are always singly connected and convex.

Least squares for classification

Even as a discriminant function(where we use it to make decisions directly and dispense with any probabilistic interpretation)it suffers from some severe problems.Least-squares solutions lack robustness to outliers,and this applies equally to the classification application.Additional points(outliers) produce a significant change in the location of the decision boundary.The sum-of-squares error function penalizes predictions that are ’too correct’ in that they lie a long way on the correct side of the decision boundary.More than lack of robustness,least-squares solutions may gives poor results on classification problems.

The failure of least squares lies in the fact that it corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution,whereas binary target vectors clearly have a distribution that is far from Gaussian(a Bernoulli).

Fisher’s linear discriminant

representation

One way to view a linear classification model is in terms of dimensionality reduction.Consider first the case of two classes,and suppose we take the \(D-dimensional\) input vector \(\vec{x}\) and project it down to one dimension using \[\label{eqn:Fisher LDA projection} y = \vec{w}^T\vec{x}.\] If we place a threshold on \(y\) and classify \(y \geq -w_0\) as class \(\mathcal{C_1}\),and otherwise class \(\mathcal{C_2}\),then we obtain our standard linear classifier discussed in the previous section.

evaluation

By adjusting the components of the weight vector \(\vec{w}\),we can select a projection that maximizes the class separation.To begin with,consider a two-class problem in which there are \(N_1\) points of class \(\mathcal{C_1}\) and \(N_2\) points of class \(\mathcal{C_2}\),so that the mean vectors of the two classes are given by \[\vec{m_1} = \dfrac{1}{N_1} \sum_{n\in \mathcal{C_1}}{\vec{x_n}},\vec{m_1} = \dfrac{1}{N_2}\sum_{n\in \mathcal{C_2}}\vec{x_n}\] The simplest measure of the separation of the classes,when projected onto \(\vec{w}\),is the separation of the projected class means.This suggests that we might choose \(\vec{w}\) so as to maximize \[m_2 - m_1 = \vec{x}^T(\vec{m_2}-\vec{m_1})\] where\[m_k = \vec{w}^T\vec{m_k}\] is the mean of the projected data from class \(\mathcal{C_k}\).However this expression can be makde arbitrarily large simply by increasing the magnitude of \(\vec{w}\).To solve this,we could constrain \(\vec{2}\) to have unit length,so that \(\sum_{i}{w_i^2}=1\).Using a Lagrange multiplier to perform the constrained maximization,we then find that \(\vec{w} \propto (\vec{m_2} - \vec{m_1})\).There is still a problem with this approach that the projected data have considerable overlap for strongly nondiagonal convariances of the class distributions.

optimization

The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class,thereby minimizing the class overlap.The projection formula [eqn:Fisher LDA projection] transforms the set of labelled data points in \(\vec{x}\) into a labelled set in the one-dimensional space \(y\).The within-class variance of the transformed data from class \(\mathcal{C_k}\) is therefore given by \[s_k^2 = \sum_{n\in \mathcal{C_k}}(y_n-m_k)^2\] where\[y_n = \vec{w}^T\vec{x_n}\] We can define the total within-class variance for the whole data set to be simply \(s_1^2+s_2^2\).The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by \[J(\vec{w}) = \dfrac{(m_2-m_1)^2}{s_1^2+s_2^2}\]

We can make the dependence on explicit by rewrite the Fisher criterion in the form \[\label{eqn:Fisher criterion} J(\vec{w}) = \dfrac{\vec{w}^T\vec{S_B}\vec{w}}{\vec{w}^T\vec{S_W}\vec{w}}\] where \(\vec{S_B}\) is the \(between-class\) covariance matrix,given by \[\label{eqn:Fisher between-class covariance matrix} \vec{S_B} = (\vec{m_2}-\vec{m_1})(\vec{m_2}-\vec{m_1})^T\] so \[\vec{w}^T\vec{S_B}\vec{w} = \vec{w} ^T(\vec{m_2} - \vec{m_1})(\vec{m_2}-\vec{m_1})^T\vec{w} = (m_2-m_1)(m_2-m_1)\] and \(\vec{S_W}\) is the total \(within-class\) covariance matrix,given by \[\vec{S_W} = \sum_{\mathcal{C}}\sum_{n\in \mathcal{C}_1}(\vec{x_n} - \vec{m_1})(\vec{x_n}-\vec{m_1})^T\] so \[\vec{w}^T\vec{S_W}\vec{w} = \sum_{\mathcal{C}_k}\sum_{n\in \mathcal{C}_k}{\vec{w}^T(\vec{x_n}-\vec{m_k})(\vec{x_n}-\vec{m_k})^T} = \sum_{\mathcal{C}_i}\sum_{n \in \mathcal{C}_k}(y_n-m_k)^2\] Differentiating [eqn:Fisher criterion] with respect to \(\vec{w}\),we find that \(J(\vec{w})\) is maximized when \[{(\vec{w}^T\vec{S_B}\vec{w})\vec{S_W}\vec{w}} = (\vec{w}^T\vec{S_W}\vec{w})\vec{S_B}\vec{w}\] Rewriting this as \[\label{eqn:Fisher generalized eigenvalue} \vec{S_B}\vec{w} = \lambda\vec{S_W}\vec{w}\] where \[\lambda = \dfrac{(\vec{w}^T\vec{S_B}\vec{w})}{(\vec{w}^T\vec{S_W}\vec{w})}\] which is called a generalized eigenvalue problem solution.

\[\begin{aligned} \because\begin{cases} \dfrac{\partial}{\partial x}\dfrac{f(x)}{g(x)} = \dfrac{f'g-fg'}{g^2}, \text{where } f^\prime = \dfrac{\partial f(x)}{\partial x} \text{ and } g' = \dfrac{\partial g(x)}{\partial x} \\ \dfrac{\partial}{\partial \vec{x}} \vec{x}^T\vec{A}\vec{x} = (\vec{A}+\vec{A}^T)\vec{x} \end{cases} \\ \therefore\end{aligned}\]

\[\begin{aligned} \dfrac{\partial}{\partial \vec{w}}J(\vec{w}) & = \dfrac{2\vec{S_B}\vec{w}(\vec{w}^T\vec{S_W}\vec{w})-(\vec{w}^T\vec{S_B}\vec{w})\vec{S_W}\vec{w}}{(\vec{w}^T\vec{S_W}\vec{w})^2}\end{aligned}\]

Setting the derivative to zero,we can get the result.

In particular,since \[\lambda\vec{w} = \vec{S_W}(\vec{m_2}-\vec{m_1})(m_2-m_1)\] Multiply both sides of [eqn:Fisher generalized eigenvalue] by \(\vec{S_W}^{-1}\) we then obtain \[\label{eqn:Fisher's linear discriminant} \vec{w} \propto \vec{S_W}^{-1}(\vec{m_2}-\vec{m_1})\] Note that if the within-class covariance is isotropic,so that \(\vec{S_W}\) is proportional to the unit matrix(\(\vec{S_W} \propto \vec{I}\)),we find that \(\vec{w}\) is proportional to the difference vector of the class means.

The result [eqn:Fisher’s linear discriminant] is known as \(Fisher's linear discriminant\),although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension.However,the projected data can subsequently be used to construct a discriminant,by choosing a threshold \(y_0\) so that we classify a new point as belonging to \(\mathcal{C}_1\) if \(y(\vec{x}) \geq y_0\) and can classify it as belonging to \(\mathcal{C}_2\) otherwise.

Extensions to higher dimensions and multiple classes

We can extend the above idea to multiple classes,and to higher dimensional subspaces,by finding a projection \(matrix \vec{W}\) which maps from \(D\) to \(L\) so as to maximize \[J(\vec{W}) = \dfrac{|\vec{W}\vec{\Sigma_B}\vec{W}^T|}{|\vec{W}\Sigma_W\vec{W}^T|}\] where \[\begin{aligned} \Sigma_B \triangleq \sum_{c}{\dfrac{N_c}{N}(\vec{m_c}-\vec{m})(\vec{m_c}-\vec{m})^T} \\ \Sigma_W \triangleq \sum_{c}\dfrac{N_c}{N}\Sigma_c \\ \Sigma_c \triangleq \sum_{i:y_i=c}(\vec{x_i}-\vec{m_c})(\vec{x_i}-\vec{m_c})^T\end{aligned}\] The solution can be shown to be \[\vec{W} = \Sigma_W^{-\frac{1}{2}}\vec{U}\] where \(\vec{U}\) are the \(\vec{L}\) leading eigenvectors of \(\Sigma_W^{-\frac{1}{2}}\Sigma_B\Sigma_W^{-1\frac{1}{2}}\),assuming \(\Sigma_W\) is non-singular.(If it is singular,we can first perform PCA on all the data).

Probabilistic interpretation of FLDA *

Relation to least squares

Fisher’s discriminant for multiple classes

The perceptron algorithm

Another example of a linear discriminant model is the perceptron of Rosenblatt(1962),which corresponds to a two-class transformation to give a feature vector \(\phi(\vec{x})\),and this is then used to construct a generalized linear model of the form \[y(\vec{x}) = f(\vec{w}^T\vec{\phi}(\vec{x}))\] where the nonlinear activation function \(f(\cdot)\) is given by a step function of the form \[f(a) = \begin{cases} +1,a \geq 0 \\ -1,a \leq 0 \end{cases}\]

error function

A natural choice of error function would be the total number of misclassified patterns,which does not lead to a simple learning algorithm because the error is a piecewise constant function of \(\vec{w}\),with discontinuities wherever a change in \(\vec{w}\) causes the decision boundary to move across one of the data points.Methods based on gradient can’t be applied.An alternative error function is known as the perceptron criterion,given by \[E_P(\vec{w}) = -\sum_{n\in \mathcal{M}}{\vec{w}^T\vec{\phi_n}(\vec{x})t_n}\] where we use \(t\in \{-1,+1\}\) coding scheme, \(\mathcal{M}\) denotes the set of all misclassified patterns.The perceptron criterion associates zero error with any pattern that is correctly classified,whereas for a misclassified pattern \(\vec{x}\) it tries to minimize the quantity \(-\vec{w}^T\vec{\phi_n}t_n\).The perceptron learning rule is not guaranteed to reduce the total error function at each stage.

However,the perceptron convergence theorem states that if there exists an exact solution(linearly separable),then the perceptron learning algorithm is guaranteed to find an exact solution if a finite number of steps.

Aside from difficulties with the learning algorithm,the perceptron does not provide probabilistic outputs,nor does it generalize readily to \(K>2\) classes.The most important limitation,however,arises from the fact that it is based on linear combinations of fixed basis functions.

optimization

We now apply the stochastic gradient descent algorithm to this error function.The change in the weight vector \(\vec{w}\) is then given by \[\vec{w}^{\tau+1} = \vec{w}^{\tau} - \eta\nabla E_P(\vec{w}) = \vec{w}^{\tau} + \eta\phi_n t_n\] where \(\eta\) is the learning rate parameter and \(\tau\) is an integer that indexes the steps of the algorithm.Because the perceptron function \(y(\vec{x},\vec{w})\) is unchanged if we multiply \(\vec{w}\) by a constant,so we can set the learning rate \(\eta\) equal to 1 without of generality.

Probabilistic Generative Models

We turn next to a probabilistic view of classification and show how models with linear decision boundaries arise from simple assumptions about the distribution of the data,adopting a generative approach in which we model the class-conditional densities \(p(\vec{x}|\mathcal{C}_k)\),as well as the class priors \(p(\mathcal{c}_k)\),and then use these to compute posterior probabilities \(p(\mathcal{C}_k|\vec{x})\) through Bayes’ theorem.

Two classes

Consider first of all the case two classes.The posterior probability for class \(\mathcal{C}_1\) can be written as \[\begin{aligned} p(\mathcal{C}_1|\vec{x}) &= \dfrac{p(\vec{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\vec{x}|\mathcal{C}_1)p(\mathcal{C}_1)+p(\vec{x}|\mathcal{C}_2)p(\mathcal{C}_2)} \\ &= \dfrac{1}{1+\exp(-a)} = \sigma(a)\end{aligned}\] where we have defined \[a = \ln \dfrac{p(\vec{x}|\mathcal{C}_1)p(\mathcal{C}_1)}{p(\vec{x}|\mathcal{C}_2)p(\mathcal{C}_2)}\] and \(\sigma(a)\) is the logistic sigmoid function defined by \[\sigma(a) = \dfrac{1}{1+\exp(-a)}\] The term ’sigmoid’ means S-shaped,sometimes called a ’squashing function’ because it maps the whole real axis into a finite interval.It satisfies symmetry property \[\sigma(-a) = 1 - \sigma(a)\] The inverse of the logistic sigmoid is given by \[a = \ln(\dfrac{\sigma}{1-\sigma})\] and is known as the logit function.It represents the log of the ratio of probabilities \(ln[p(\mathcal{C}_1|\vec{x})/p(\mathcal{C}_2|\vec{x})]\) for the two classes,also known as the log odds.

Multiple classes

For the case of \(K>2\) classes,we have posterior \[\begin{aligned} p(\mathcal{C}_k|\vec{x}) &=\dfrac{p(\vec{x}|\mathcal{C}_k)p(\mathcal{C}_k)}{\sum_{j}{p(\vec{x}|\mathcal{C}_j)p(\mathcal{C}_j)}} \\ &=\dfrac{\exp(a_k)}{\sum_j\exp(a_j)}\end{aligned}\] which is known as the normalized exponential and can be regarded as a multiclass generalization of the logistic sigmoid.Here the quantities \(a_k\) are defined by \[a_k=\ln p(\vec{x}|\mathcal{C}_k)p(\mathcal{C}_k)\] The normalized function is also known as the softmax function,as it represents a smoothed version of the ’max’ function because,if \(a_k \gg a_j\) for all \(j\neq k\),then \(p(\mathcal{C}_k|\vec{x}) \simeq 1\) and,\(p(\mathcal{C}_j|\vec{x}) \simeq 0\).

The following investigate the consequences of choosing specific forms for the class-conditional densities,continuous and discrete inputs.

Continuous inputs

Assume that the class-conditional densities are Gaussian sharing the same covariance matrix and then explore the resulting form for the posterior probabilities.Thus the density(pdf) for class \(\mathcal{C}_k\) is given by \[p(\vec{x}|\mathcal{C}_k) = \dfrac{1}{(2\pi)^{D/2}}\dfrac{1}{|\vec{\Sigma}|^{1/2}} \exp\{-\dfrac{1}{2}(\vec{x}-\vec{\mu}_k)^T\vec{\Sigma}^{-1}(\vec{x}-\vec{\mu}_k) \}\] In the case of two classes,we have \[\begin{aligned} a(\vec{x}) &= \log\dfrac{p(\vec{x}|\mathcal{C}_1)p(\mathcal{C}_1)} {p(\vec{x}|\mathcal{C}_2)p(\mathcal{C}_2)} \\ &=-\dfrac{1}{2}(\vec{x}-\vec{\mu}_1)^T\vec{\Sigma}^{-1}(\vec{x}-\vec{\mu}_1) + \dfrac{1}{2}(\vec{x}-\vec{\mu}_2)^T\vec{\Sigma}^{-1}(\vec{x}-\vec{\mu}_2) + \log\dfrac{p(\mathcal{C}_1)}{p(\mathcal{C}_2)} \\ &=(\vec{\mu_1}-\vec{\mu_2})^T\vec{\Sigma}^{-1}\vec{x}- \dfrac{1}{2}\vec{\mu_1}^T\vec{\Sigma}^{-1}\vec{\mu_1} + \dfrac{1}{2}\vec{\mu_2}^T\vec{\Sigma}^{-1}\vec{\mu_2} + \log\dfrac{p(\mathcal{C}_1)}{\mathcal{C}_2} \\ &=\vec{w}^T\vec{x}+w_0\end{aligned}\] where \[\begin{aligned} \vec{w} &= \vec{\Sigma}^{-1}(\vec{\mu_1}-\vec{\mu_2}) \\ w_0 &= -\dfrac{1}{2}\vec{\mu_1}^T\vec{\Sigma}^{-1}\vec{\mu_1} + \dfrac{1}{2}\vec{\mu_2}^T\vec{\Sigma}^{-1}\vec{\mu_2} + \log\dfrac{p(\mathcal{C}_1)}{\mathcal{C}_2}\end{aligned}\] We see that the quadratic terms in \(\vec{x}\) from the exponents of the Gaussian densities have cancelled (due to the common covariance matrices) leading to a linear function of \(\vec{x}\) in the argument of logistic sigmoid.The prior \(p(\mathcal{C}_k)\) enter only through the bias parameter \(w_0\) so that it have effect of the parallel contours of constant posterior probability.

For the general case of \(K\) cases we have \[\begin{aligned} a_k(\vec{x}) &= \log p(\vec{x}|\mathcal{C}_k)p(\mathcal{C}_k) \\ &=\vec{\mu_k}^T\vec{\Sigma}^{-1}\vec{x} - \dfrac{1}{2}{\vec{\mu_k}}^T\vec{\Sigma}^{-1}\vec{\mu_k} + \vec{x}^T\vec{\Sigma}^{-1}\vec{x}+ \log p(\mathcal{C}_k) + const\end{aligned}\] Notice that \(\vec{x}\vec{\Sigma}^{-1}\vec{x}\) is cancelled in the exponent of the softmax function.Therefore we can write \[a_k(\vec{x}) = \vec{w_k}^T\vec{x}+w_{k0}\] where \[\begin{aligned} \vec{w_k} &= \vec{\Sigma}^{-T}\vec{\mu_k} \\ w_{k0} &= -\dfrac{1}{2}\vec{\mu_k}^T\vec{\Sigma}^{-1}\vec{\mu_k} + \ln p(\mathcal{C}_k)\end{aligned}\] We see that the \(a_k(\vec{x})\) are again linear functions of \(\vec{x}\) due to the shared covariances,otherwise we obtain quadratic functions of \(\vec{x}\),giving rise to a quadratic discriminant.

Maximum likelihood solution

Having specified a parametric functional form for the class-conditional densities \(p(\vec{x}|\mathcal{C}_k)\),we can then determine the values of the parameters,together with the prior class probabilities \(p(\mathcal{C}_k)\),using maximum likelihood.

Consider first the case of two classes,each having a Gaussian class-conditional density with a shared covariance matrix,and suppose we have a data set \(\{\vec{x_n},\vec{t_n}\}\) where \(n=1,...N\).Here \(t_n=1\) denotes class \(\mathcal{C}_1\) and \(0\) denotes \(\mathcal{C}_2\).Denote the prior class probability \(p(\mathcal{C}_1) = \pi\) so that \(p(\mathcal{C}_2) = 1- \pi\).For a data point \(\vec{x_n}\) from class \(\mathcal{C}_1\),we have \(t_n=1\) and hence \[p(\vec{x_n},\mathcal{C}_1) = p(\mathcal{C}_1)p(\vec{x}|\mathcal{C}_1) = \pi \mathcal{N}(\vec{x_n}|\vec{\mu_1},\vec{\Sigma})\] For class \(\mathcal{C}_2\) ,and hence \[p(\vec{x_n},\mathcal{C}_2) = p(\mathcal{C}_1)p(\vec{x}|\mathcal{C}_2) = \pi \mathcal{N}(\vec{x_n}|\vec{\mu_2},\vec{\Sigma})\] Thus the likelihood is given by \[p(\vec{t}|\pi,\vec{\mu_1},\vec{\mu_2}) = \prod_{n=1}^{N} [\pi \mathcal{N}(\vec{x_n}|\vec{\mu_1},\vec{\Sigma})]^{t_n} [(1-\pi)\mathcal{N}(\vec{x_n}|\vec{\mu_2},\vec{\Sigma})]^{1-t_n}\] where \(\vec{t} = (t_1,...,t_N)^T\). Maximize the log likelihood \[\log p(\vec{t}|\pi,\vec{\mu_1},\vec{\mu_2}) = \sum_{n=1}^{N}\{t_n\log \pi + t_n\log \mathcal{N(\vec{x_n}|\vec{\mu_1},\vec{\Sigma})}+ (1-t_n)\log(1-\pi) + (1-t_n)\log \mathcal{N}(\vec{x_n}|\vec{\mu_2},\vec{\Sigma}) \}\] The terms that depend on \(\pi\) are \[\sum_{n=1}^{N}\{t_n\log\pi + (1-t_n)\ln(1-\pi) \}\] Setting the derivative with respect to \(\pi\) equal to zero and rearranging,we obtain \[\begin{aligned} &\sum_{n=1}^{N}\{t_n\dfrac{1}{\pi} +(t_n-1)\dfrac{1}{1-\pi} \} = 0 \\ \Rightarrow & \pi\sum_{n=1}^{N}(t_n-1) -\pi\sum_{n=1}^{N}t_n + \sum_{n=1}^{N}t_n = 0 \\ \Rightarrow & \pi = \dfrac{1}{N}\sum_{n=1}^{N}t_n = \dfrac{N_1}{N}=\dfrac{N_1}{N_1+N_2}\end{aligned}\] where \(N_k\) denotes the total number of data points in class \(\mathcal{C}_k\).Thus the maximum likelihood estimate for \(pi\) is simply the fraction of points in class as expected,which can be easily generalized to the multiclass case where ... .

Now consider the maximization with respect to \(\vec{\mu_1}\).The log likelihood function those terms that depend on \(\vec{\mu_1}\) \[\sum_{n=1}^{N}t_n\ln\mathcal{N}(\vec{x_n}|\vec{\mu_1},\Sigma) = -\dfrac{1}{2}\sum_{n=1}^{N}t_n(\vec{x_n} - \vec{\mu_1})^T\vec{\Sigma}^{-1}(\vec{x_n}-\vec{\mu_1})+ const\] Setting the derivative with respect to \(\vec{\mu_1}\) to zero and arranging ,we obtain \[\vec{\mu_1} = \dfrac{1}{N_1}t_n\vec{x_n}\] which is simply the mean of all the input vectors \(\vec{x_n}\) assigned to corresponding class. Similarly, \[\vec{\mu_2} = \dfrac{1}{N_2}t_n\vec{x_n}\] Finally,consider the maximum likelihood solution for the shared covariance matrix \(\Sigma\).We have \[\begin{aligned} &-\dfrac{1}{2}\sum_{n=1}^{N}t_n\ln|\Sigma| - \dfrac{1}{2}\sum_{n=1}^{N}t_n(\vec{x_n} - \vec{\mu_1})^T\Sigma^{-1}(\vec{x_n}-\vec{\mu_1}) \\ &-\dfrac{1}{2}\sum_{n=1}^{N}(1-t_n)\ln|\Sigma| - \dfrac{1}{2}\sum_{n=1}^{N}(1-t_n)(\vec{x_n}-\vec{\mu_2})^T\Sigma^{-1}(\vec{x_n}-\vec{\mu_2})\\ &= -\dfrac{N}{2}\ln|\Sigma| -\dfrac{N}{2}Tr\{\Sigma^{-1}\vec{S}\} \end{aligned}\] where we have defined \[\begin{aligned} \vec{S} &= \dfrac{N_1}{N}\vec{S_1} + \dfrac{N_2}{N}\vec{S_2} \\ \vec{S_1}&= \dfrac{1}{N_1}\sum_{n\in \mathcal{C}_1}{(\vec{x_n}-\vec{\mu_1})(\vec{x_n}-\vec{\mu_1})} \\ \vec{S_2}&= \dfrac{1}{N_2}\sum_{n\in \mathcal{C}_2}{(\vec{x_n}-\vec{\mu_2})(\vec{x_n}-\vec{\mu_2})} \end{aligned}\] Using the standard result for the maximum likelihood solution for a Gaussian distribution,we see that \[\vec{\Sigma} = \vec{S}\] which represents a weighted average of the covariance matrices associated with each of the two classes separately.¹

This result is easily extended to the \(K\) class problem.Note that the approach of fitting Gaussian distributions to the classes is not robust to outliers,because the maximum likelihood estimation of a Gaussian is not robust.

Discrete features

Assume binary feature value \(x_i\in \{0,1\}\) is Bernoulli distributed,thus we have class-conditional distributions of the form \[p(\vec{x}|\mathcal{C}_k) = \prod_{i=1}^{D}\mu_{ki}^{x_i}(1-\mu_{ki})^{1-x_i}\] which contains \(D\) independent parameters for each class,giving \[a_k(\vec{x}) = \sum_{i=1}^{D}\{x_i\ln\mu_{ki}+(1-x_i)\ln(1-\mu_ki)\} + \ln p(\mathcal{C}_k)\] which again are linear functions of the input values \(x_i\).Analogous results are obtained for discrete variables each of which can take \(M>2\) states.

Exponential family

As we have see so far,for both Gaussian distributed and discrete inputs,the posterior class probabilities are given by generalized linear models with logistic sigmoid(\(K=2\) classes) or softmax(\(K\gg 2\) classes) activation functions.These are particular cases of a more general result obtained by assuming that the class-conditional densities p(x|Ck) are members of the exponential family of distributions. Using the form [sec:exponential-family] for members of the exponential family, we see that the distribution of x can be written in the form \[p(\vec{x}|\vec{\lambda_k}) = h(\vec{x})g(\vec{\lambda}_k) \exp\{\vec{\lambda}_k^T\vec{u}(\vec{x}) \}\] We restrict attention to the subclass for which \(\vec{u}(\vec{x}) = \vec{x}\).Making use of [eqn:density scale parameter] to introduce a scaling parameter \(s\),then \[\label{eqn:restricted exponential family} p(\vec{x}|\vec{\lambda_k},s) = \dfrac{1}{s}h(\vec{x}/s)g(\vec{\lambda}_k) \exp\{\dfrac{1}{s}\vec{\lambda}_k^T\vec{u}(\vec{x}) \}\] Note that each class have its own parameter vector \(\vec{\lambda}_k\) but share the same scale parameter \(s\).

For the two-class problem,the posterior is given by a logistic sigmoid acting on a linear function \[a(\vec{x}) = (\vec{\lambda}_1-\vec{\lambda}_2)^T\vec{x} + \ln g(\vec{\lambda}_1) - \ln g(\vec{\lambda}_2) +\ln p(\mathcal{C}_1)-\ln p(\mathcal{C}_2)\] Similarly,for the \(K\)-class problem, \[a_k(\vec{x}) = \vec{\lambda}_k^T\vec{x}+\ln g(\vec{\lambda}_k)+\ln p(\mathcal{C}_k)\] and so again is a linear function of \(\vec{x}\).

Probabilistic Discriminative Models

We have seen that the posterior probability for the two-class classification and multiclass case can be written as a logistic and softmax function respectively for a wide choice of class-conditional distributions \(p(\vec{x}|\mathcal{C}_k)\).For a specific class-conditional densities \(p(\vec{x}|\mathcal{C}_k)\),we use maximum likelihood to determine the parameters of the densities as well as the class priors \(p(\mathcal{C}_k)\) and use Bayes’ theorem to find the posterior class probabilities.However,an alternative approach is to use the functional form of the generalized linear model explicitly and to determine its parameters directly by using maximum likelihood,such as iterative reweighted least squares or IRLS.

The indirect approach to finding the parameters of a generalized linear model by fitting class-conditional densities and class priors separately and then applying Bayes’ theorem,represents an example of gernerative modelling，because we could take such a model and generate synthetic data by drawing values of \(\vec{x}\) from the marginal distribution \(p(\vec{x})\).In the direct approach,we maximize a likelihood function defined through the conditional distribution \(p(\mathcal{C}_k|\vec{x})\),which represents a form of discriminative training.

Fixed basis functions

So far,we have considered classification models that work directly with the original input vector \(\vec{x}\).However,all of the algorithms can equally applicable if we first make a fixed nonlinear transformation of the inputs using a vector of basis functions \(\phi(\vec{x})\).The resulting decision boundaries will be linear in the feature space \(\phi\),corresponding to nonlinear decision boundaries in the original \(\vec{x}\) space.One of the basis function is typically set to a constant,say \(\phi_0(\vec{x})=1\),so that the corresponding parameter \(w_0\) plays the role of a bias.

Note that nonlinear transformations cannot remove class overlap between the class-conditional densities \(p(\vec{x}|\mathcal{C}_k)\).

Logistic regression

representation

We begin with two-class classification.The posterior probability of class \(\mathcal{C}_1\) can be written as a logistic sigmoid acting on a linear function of the feature vector \(\phi\) so that \[p(\mathcal{C}_1|\phi) = y(\phi) = \sigma(\vec{w}^T\phi)\] which has a linear dependence on the number of parameters.Logistic regression model can also be written as \[p(y|\vec{x},\vec{w})=\mathrm{Ber}(y|\mathrm{sigm}(\vec{w}^T\vec{\phi}))\] where \(\vec{w}\) and \(\vec{\phi}\) are extended vectors, i.e., \(\vec{w}=(b, w_1, w_2,\cdots, w_D)\), \(\vec{\phi}=(1, \phi_1, \phi_2,\cdots, \phi_D)\).

Note that the derivative of the logistic sigmoid function is \[\dfrac{d\sigma}{da} = \sigma(1-\sigma)\]

evaluation

For a data set \(\{\phi_n,t_n \}\),where \(t_n \in \{0,1\}\) and \(\phi_n = \phi(\vec{x_n})\),with \(n = 1,...,N\),the likelihood function can be written \[p(\vec{t}|\vec{w}) = \prod_{n=1}^{N}y_n^{t_n}\{1-y_n \}^{1-t_n}\] where \(\vec{t}=(t_1,...,t_N)^T\) and \(y_n = p(\mathcal{C}_1|\phi_n)\).The negative logarithm of the likelihood,which gives the cross-entropy error function is in the form \[E(\vec{w})=-\ln p(\vec{t}|\vec{w})=-\sum_{n=1}^{N}\{t_n\ln y_n +(1-t_n)\ln (1-y_n) \}\] Taking the gradient of the error function with respect to \(\vec{w}\),we obtain \[\begin{aligned} \nabla E(\vec{w}) &=-\sum\limits_{n=1}^{N}\{t_n\dfrac{y_n(1-y_n)}{y_n}- (1-t_n)\dfrac{y_n(q-y_n)}{1-y_n} \}\phi_n\\ &=-\sum_{n=1}^{N}\{t_n(1-y_n)-(1-t_n)y_n \}\phi_n\\ &=\sum_{n=1}^{N}(y_n-t_n)\phi_n\end{aligned}\] We see that the factor involving the derivative of the logistic sigmoid has cancelled.In particular,the contribution to the gradient from data point \(n\) is given by the ’error’ \(y_n-t_n\) between the target value and the prediction of the model,times the basis function vector \(\phi_n\),taking the same form as the gradient of the sum-of-squares error function for the linear regression model.

Note that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable.The singularity can be avoided by inclusion of a prior and finding a MAP(maximum a posterior) solution for \(\vec{w}\),or equivalently by adding a regularization term to the error function.

optimization

See the following contents.

Iterative reweighted least squares

Newton-Raphson update

For logistic regression,there is no longer a closed-form solution,due to the nonlinearity of the logistic sigmoid function.The error function is concave,and can be minimized by an efficient iterative technique based on the Newton-Raphson iterative optimization scheme,which uses a local quadratic approximation to the log likelihood function. \[\vec{w}^{(new)} = \vec{w}^{(old)}-\vec{H}^{-1}\nabla E(\vec{w})\] where \(\vec{H}\) is the Hessian matrix whose elements comprise the second derivatives of \(E(\vec{w})\) with respect to the components of \(\vec{w}\).

Now apply the Newton-Raphson method ot the linear regression model with sum-of-squares error function.The gradient and Hessian of this error function are given by \[\begin{aligned} \nabla E(\vec{w}) &= \sum_{n=1}^{N}(\vec{w}^T\phi_n-t_n)\phi_n = \vec{\Phi}^T\vec{\Phi}\vec{w} - \vec{\Phi}^T\vec{t} \\ \vec{H} = \nabla\nabla E(\vec{w}) &= \sum_{n=1}^{N}\phi_n\phi_n^T=\vec{\phi}^T\vec{\Phi}\end{aligned}\] where \(\vec{\Phi}\) is the \(N\times M\) design matrix,whose \(n^{th}\) row is given by \(\phi_n^T\).The Newton-Raphson update then takes the form \[\begin{aligned} \vec{w}^{(new)} &= \vec{w}^{(old)} -(\vec{\Phi}^T\vec{\Phi})^{-1}\{\vec{\Phi}^T\vec{\Phi}\vec{w}^{(old)} -\vec{\Phi}^T\vec{t} \} \\ &=(\vec{\Phi}^T\vec{\Phi})^{-1}\vec{\Phi}^T\vec{t}\end{aligned}\] which we recognize as the standard least-squares solution.

Now apply the Newton-Raphson update to the cross-entropy error function for logistic regression model.Gradient and Hessian of this error function are given by \[\begin{aligned} \nabla E(\vec{w})&=\sum_{n=1}^{N}(y_n-t_n)\phi_n =\vec{\Phi}^T(\vec{y}-\vec{t}) \\ \vec{H} &= \nabla\nabla E(\vec{w})= \sum_{n=1}^{N}y_n(1-y_n)\phi_n\phi_n^T=\vec{\Phi}^T\vec{R}\vec{\Phi}\end{aligned}\] where we have introduced the \(N\times N\) diagonal matrix \(\vec{R}\) with elements \[R_{nn}=y_n(1-y_n)\] We see that the Hessian depends on \(\vec{w}\) and is positive definite.

The Newton-Raphson update formula for the logistic regression model then becomes \[\begin{aligned} \vec{w}^{(new)} &= \vec{w}^{(old)} -(\vec{\Phi}^T\vec{R}\vec{\Phi})^{-1}\vec{\Phi}^T(\vec{y}-\vec{t}) \\ &=(\vec{\Phi}^T\vec{R}\vec{\Phi})^{-1}\{\vec{\Phi}^T\vec{R}\vec{\Phi}\vec{w}^{(old)} -\vec{\Phi}^T(\vec{y}-\vec{t}) \} \\ &=(\vec{\Phi}^T\vec{R}\vec{\Phi})^{-1}\vec{\Phi}^T\vec{R}\vec{z}\end{aligned}\] where \(\vec{z}\) is an \(N\)-dimensional vector with elements \[\vec{z} = \vec{\Phi}\vec{w}^{(old)}-\vec{R}^{-1}(\vec{y}-\vec{t})\] We see that the update formula takes the form of a set of normal equations for a weighted least-squares problem.Because the weighting matrix \(\vec{R}\) depends on the parameter vector \(\vec{w}\),we must apply the normal equations iteratively,each time using the new weight vector \(\vec{w}\) to compute a revised weighing matrix \(\vec{R}\).For this reason,the algorithm is known as iterative reweighted least squares,or IRLS.The elements of the diagonal weighing matrix \(\vec{R}\) can be interpreted as variances because the mean and variance of \(t\) in the logistic regression model are given by \[\begin{aligned} \mathbb{E}[t] &=\sigma(\vec{x}) = y\\ var[t]&=\mathbb{E}[t^2] -\mathbb{E}[t]^2 = \sigma(\vec{x})-\sigma(\vec{x})^2 = y(1-y)\end{aligned}\] where we have used the property \(t^2=t\) for \(t\in \{0,1\}\).

Linearization

In fact,we can interpret IRLS as the solution to a linearized problem in the space of the variable \(a=\vec{w}^T\phi\).Apply Taylor expansion and then we obtain \[\begin{aligned} a_n(\vec{w}) &\simeq a_n(\vec{w}^{(old)})+\dfrac{da_n}{dy_n}\bigg|_{w^{(old)}}(t_n-y_n) \\ &=\vec{\phi}_n^T\vec{w}^{(old)}-\dfrac{(y_n-t_n)}{y_n(1-y_n)} =z_n\end{aligned}\]

Multiclass logistic regression

In our discussion of generative models for multiclass classification,the posterior probabilities are given by a softmax transformation of linear functions of the feature variables,so that \[\begin{aligned} p(\mathcal{C}_k|\phi) = y_k(\phi) &= \dfrac{\exp(a_k)}{\sum_j\exp(a_j)}\\ &=\dfrac{\exp(a_k)}{sum} \\\end{aligned}\] Softmax function is \[\begin{aligned} \delta(\vec{z}_j) &= \dfrac{\exp^{\vec{z}_j}}{\sum_{k=1}^{K}\exp^{\vec{z}_k}},\text{for } j=1,...,K. \\ &=\dfrac{\exp^{\vec{z}_j-\max{\vec{z}}}}{\sum_{k=1}^{K}\exp^{\vec{z}_k - \max{\vec{z}}}} \\ \log \delta(\vec{z}_j) &= \vec{z}_j-\max{\vec{z}}-\log\sum_{k=1}^{K}\exp^{\vec{z}_k - \max{\vec{z}}}\end{aligned}\]

where \(sum\) denotes \(\sum_j\exp(a_j)\) the ’activations’ \(a_k\) are given by \[a_k = \vec{w}_k^T\phi\]

Here we consider the use of maximum likelihood to determine the parameters \(\{\vec{w}_k\}\) of this model directly.The derivatives of \(y_k\) with respect to all of the activations \(a_j\) are given by \[\begin{aligned} \dfrac{\partial y_k}{\partial a_j} &=\dfrac{e^{a_k}\vec{I}_{kj}sum-e^{a_k}e^{a_j}}{sum^2} \\ %&=\dfrac{\partial\dfrac{1}{1+\sum_{i\neq k}\exp(a_i-a_k)}}{\partial a_j}\\ %&=-\dfrac{e^{a_i-a_k}}{(1+\sum_{i\neq k}e^{a_i-a_k})^2}\\ &=y_k(I_{kj}-y_j)\end{aligned}\] where \(\vec{I}_{kj}\) are the elements of the identity matrix.

Next we write down the likelihood function,using the \(1-of-K\) coding scheme. \[p(\vec{T}|\vec{w}_1,...,\vec{w}_K)= \prod_{n=1}^{N}\prod_{k=1}^{K}p(\mathcal{C}_k|\phi_n)^{t_{nk}} =\prod_{n=1}^{N}\prod_{k=1}^{K}y_{nk}^{t_{nk}}\] where \(y_{nk}=y_k(\phi_n)\),and \(\vec{T}\) is an \(N\times N\) matrix of target variables with elements \(t_{nk}\).Taking the negative logarithm then gives \[E(\vec{w_1},...,\vec{w_K})=-\ln p(\vec{T}|\vec{w}_1,...,\vec{w}_K) =-\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}\ln y_{nk}\] which is known as the \(cross-entropy\) error function for the multiclass classification problem.

Take the gradient of the error function,making use of the derivatives of the softmax function,we obtain \[\begin{aligned} \because \frac{\partial E_n}{\partial a_j} &=-\sum_kt_k\frac{\partial \log p_k}{\partial a_j}\\ &=-\sum_kt_k\frac{1}{p_k}\frac{\partial p_k}{\partial a_j}\\ &=-t_j(1-p_j)-\sum_{k\neq j}t_k\frac{1}{p_k}({\color{red}{-p_kp_j}})\\ &=-t_j(1-p_j)+\sum_{k\neq j}t_k({\color{red}{p_j}})\\ &=-t_j+\color{blue}{t_jp_j+\sum_{k\neq j}t_k({p_j})}\\ &=\color{blue}{p_j\left(\sum_kt_k\right)}-t_j \\ &=p_j-t_j \\ &=y_j-t_j \\ \\ \therefore \nabla_{w_j}E(\vec{w_1},...,\vec{w_K}) &= \sum_{n=1}^{N}(y_{nj}-t_{nj})\phi_n \\ \therefore \nabla_{\vec{W}}E(\vec{w}_1,...\vec{w}_k) &= (\vec{Y}-\vec{T})^T\vec{\Phi}\end{aligned}\] where \(\vec{Y}\) is the \(N\times K\) prediction matrix, \(\vec{T}\) is the \(N\times K\) target label matrix. \[\]

To find a batch algorithm,we evaluate the Hessian matrix \[\nabla_{w_k}\nabla_{w_j}E(\vec{w_1},...,\vec{w_K}) =-\sum_{n=1}^{N}y_{nk}(I_{kj}-y_{nj})\phi_n\phi_n^T\] The Hessian matrix is also positive definite and so the error function has a unique minimum.

Probit regression

A broad range of class-conditional distributions,described by the exponential family,the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables,but not all choices of class-conditional density give rise to such a simple form for the posterior probabilities.

The cumulative distribution function is given by \[\Phi(a) = \int_{-\infty}^{a}\mathcal{N}(\theta|0,1)d\theta\] which is known as the probit function.It has a sigmoidal shape.A evaluation of a closely related function is \[erf(a) = \dfrac{2}{\sqrt{\pi}}\int_{0}^{a}\exp(-\theta^2/2)d\theta\] known as the erf function or error function.The generalized linear model based on a probit function is known as probit regression.

One issue that occur in practical applications is that of outliers.Because probit activation function they decay like \(\exp(-x^2)\),so probit model can be significantly more sensitive to outliers.However,the effect of mislabelling is easily incoporated into a probabilistic model by introducing a probability \(\epsilon\) that the target value \(t\) has been flipped to the wrong value, \[\begin{aligned} p(t|\vec{x}) &= (1-\epsilon)\sigma(\vec{x})+\epsilon(1-\sigma(\vec{x}))\\ &=\epsilon + (1-2\epsilon)\sigma(\vec{x})\end{aligned}\] where \(\sigma(\vec{x})\) is the activation function with input vector \(\vec{x}\).Here \(\epsilon\) may be set in advance,or it may be treated as a hyperparameter whose value is inferred from data.

Canonical link functions

We now show that there is a general result of assuming a conditional distribution for the target variable from the exponential family,along with a corresponding choice for the activation function known as the canonical link function.

Making use of the restricted form of exponential family distribution [eqn:restricted exponential family] for target variable \(t\) \[p(t|\eta,s)=\dfrac{1}{s}h(\dfrac{t}{s})g(\eta)\exp\{\dfrac{\eta t}{s}\}\] The conditional mean of \(t\),which denoted by \(y\),is given by \[y \equiv \mathbb{E}[t|\eta]=-s\dfrac{d}{d\eta}\ln g(\eta)\] Thus \(y\) and \(\eta\) must be related,and we denote this relation through \(\eta = \psi(y)\).

We define a generalized linear model to be one for which \(y\) is a nonlinear function of a linear combination of the input(or feature) variables so that \[y=f(\vec{w}^T\vec{\phi})\] where \(f(\cdot)\) is known as the activation function,and \(f^{-1}(\cdot)\) is known as the link function in statistics.

The log likelihood function for this model,which,as a function of \(\eta\) ,is given by \[\ln p(\vec{t}|\eta,s) = \sum_{n=1}^{N}\ln p(t_n|\eta,s) = \sum_{n=1}^{N}\{\ln g(\eta_n)+\dfrac{\eta_n t_n}{s} \} + const\] where we assume that all observations share a common scale parameter(which corresponds to the noise variance for a Gaussian distribution for instance) and so \(s\) is independent of \(\eta\).The derivative with respect to the model parameter \(\vec{w}\) is then given by \[\begin{aligned} \nabla_w\ln p(\vec{t}|\eta,s) &=\sum_{n=1}^{N}\{\dfrac{d}{d\eta_n}\ln g(\eta_n)+\dfrac{t_n}{s} \} \dfrac{d\eta_n}{dy_n}\dfrac{dy_n}{da_n}\nabla a_n \\ &= \sum_{n=1}^{N}\dfrac{1}{s}\{t_n-y_n \}\psi'(y_n)f'(a_n)\phi_n\end{aligned}\] where \(a_n=\vec{w}^T\vec{\phi_n}\),and we have used \(y_n=f(a_n)\) together.There is a considerable simplification if we choose a particular form of the link function \(f^{-1}(y)\) given by \[f^{-1}(y) = \psi(y)\] In this case,the gradient of error function reduces to \[\nabla\ln E(\vec{w}) = \dfrac{1}{s}\sum_{n=1}^{N}\{y_n-t_n\}\phi_n\] For the Gaussian \(s=\beta^{-1}\),whereas for logistic model \(s=1\).

Linear Models for Classification

Discriminant Functions

Two classes

Multiple classes

Least squares for classification

Fisher’s linear discriminant

representation

evaluation

optimization

Extensions to higher dimensions and multiple classes

Probabilistic interpretation of FLDA *

Relation to least squares

Fisher’s discriminant for multiple classes

The perceptron algorithm

error function

optimization

Probabilistic Generative Models

Two classes

Multiple classes

Continuous inputs

Maximum likelihood solution

Discrete features

Exponential family

Probabilistic Discriminative Models

Fixed basis functions

Logistic regression

representation

evaluation

optimization

Iterative reweighted least squares

Newton-Raphson update

Linearization

Multiclass logistic regression

Probit regression

Canonical link functions

The Laplace Approximation

Model comparison and BIC

Bayesian Logistic Regression

Laplace approximation

Predictive distribution